© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
Prep.: Create Index in AI Search (Vector DB)
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
> An embedding is a vector (list) of floating point numbers. > The distance between two vectors measures their relatedness. Small distances suggest high relatedness and > large distances suggest low relatedness. > -- OpenAI
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
instance: srch-swex-rag-workshop - Indexes and verify your index is created
© Zühlke APAC SWEX+DX 2024
# Create the index
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="filename", type=SearchFieldDataType.String, filterable=True, sortable=True),
SearchableField(name="content", type=SearchFieldDataType.String),
SearchField(
name="embedding",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
hidden=False,
searchable=True,
vector_search_dimensions=1536,
vector_search_profile_name="default" # use default `myHnswProfile`
),
]
vector_search = VectorSearch(
profiles=[VectorSearchProfile(name="default", algorithm_configuration_name="default")],
algorithms=[HnswAlgorithmConfiguration(name="default")]
)
index = SearchIndex(
name=AI_SEARCH_INDEX_NAME,
fields=fields,
vector_search=vector_search
)
# Create the index
search_index = await ai_search_index_client.create_index(index)
© Zühlke APAC SWEX+DX 2024
{
"id": "0d1f45b1-c19a-4254-9390-5bd8a3b94c95",
"filename": "document.pdf",
"content": "This is the content of the document",
"embedding": [
0.00838001,
...
]
}
© Zühlke APAC SWEX+DX 2024
from Wiki - Workshop Document - Input Document as input document
© Zühlke APAC SWEX+DX 2024
Implement a Python function that Loads a local document to extract the text content.
Use the library: pypdf~=5.0.0 for PDF file.
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
Implement a Python function to chunk the document content into parts based on tokens (e.g., 500 tokens per
chunk). Ensure each chunk is self-contained without breaking sentences.
Use the library: nltk~=3.9.1
© Zühlke APAC SWEX+DX 2024
© Zühlke APAC SWEX+DX 2024
Implement a Python function to generate embeddings for individual or multiple chunks using an Azure OpenAI
model.
Use the library: openai~=1.46.1 or azure-ai-inference==1.0.0b5
© Zühlke APAC SWEX+DX 2024
to srch-swex-rag-workshop - Indexes and search your index
© Zühlke APAC SWEX+DX 2024
Help me to write a Python function that upload object which that contains contain text chunk and embedding to an
index Azure AI Search.
Use the library: azure-search-documents~=11.5.1
© Zühlke APAC SWEX+DX 2024